5. Gaussian Processes

\[ \newcommand{\bs}[1]{\mathbf{#1}} \]

  • motivating case study

  • GP regression

  • kernels

Learning outcomes

  1. Describe the theoretical foundation of intrinsically interpretable models like sparse regression, gaussian processes, and classification and regression trees, and apply them to realistic case studies with appropriate validation checks.

Reading

  • Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A visual exploration of Gaussian processes. Distill, 4(4). doi:10.23915/distill.00017

  • Deisenroth, M., Luo, Y., & van der Wilk, M. (2020). A practical guide to Gaussian processes. https://infallible-thompson-49de36.netlify.app/.

Motivating Case Study

Scientific Goal

Question: How quickly does a star rotate around its axis? This has to be answered accurately before we can search for exoplanets around the star.

Data: Telescopes make it possible to measure the brightness of a star over time. We can look for periodic patterns in the brightness.

Intuition

Ideally the brighness would be a perfect sine wave. But brightness changes in more complex ways both on the surface of the star and in the space from the star to the telescope.

Data

How does flux vary as the star rotates?

Statistical Motivation

Formulation.

  • Period \(P\) ≈ rotation period of star
  • Variations related to evolving spots

Goal. Infer \(P\) to understand stellar activity and improve exoplanet detection

Modeling. We need a statistical model of functions that:

  • Allows for nonlinearity.
  • Allows interpretation of quantities like \(P\).

Gaussian Process Regression

Gaussian Process Prior

Definition. A Gaussian Process (GP) is a collection of random variables, any finite subset of which is jointly Gaussian

\[f \sim \mathcal{GP}(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}'))\]

This can be used to define a prior distribution over functions. \[\mathbf{f} = (f(x_1), \ldots, f(x_N))^\top \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{K})\]

where \(\mu_i = m(x_i)\) and \(K_{ij} = k(x_i, x_j)\). We will set \(m(\mathbf{x}) = 0\).

Function space perspective

Classical statistics Estimate fixed-dimensional parameters \(\theta\) of a function.

GPs Estimate the entire function \(f\).

Observation This seems impossible, since functions are infinite-dimensional. But we only observe \(f\) at finitely many points, \[\begin{align*} \mathbf{f} = \left(f\left(x_1\right), \dots, f(x_M)\right) \end{align*}\] and a GP gives us a prior for any choice of \(x_{1}, \dots, x_{M}\).

This is what it means to define a “distribution over functions.”

Prior Covariance Structure

Covariance matrix \(\mathbf{K}\):

  • Diagonal: \(K_{ii} = \sigma_f^2\) is the prior variance at each point.
  • Off-diagonal: \(K_{ij}\) is the covariance between \(f(x_i)\) and \(f(x_j)\)

The kernel encodes our assumptions about function smoothness, periodicity, trends, etc.

Sampling from the Prior

Sampling from the Prior

  1. Choose \(M\) test locations \(\mathbf{x}^* = (x_1, \ldots, x_M)^\top\)
  2. Compute \(\mathbf{K}_{**} = [k(x_i^*, x_j^*)]_{i,j=1}^M \in \mathbf{R}^{M \times M}\)
  3. Sample \(\mathbf{f}^* = (f(x_1^*), \ldots, f(x_M^*)) \sim \mathcal{N}(\mathbf{0}, \mathbf{K}_{**})\)

Each sample is one plausible function from our prior.

Different kernels \(\to\) different classes of functions.

Adding Observations

We have defined a prior over functions. Now: update beliefs given data \(\{x_i, y_i\}_{i=1}^N\).

 

Prior \[f \sim \mathcal{GP}(0, k(\mathbf{x}, \mathbf{x'}))\]

Observation model \[y_i = f(x_i) + \epsilon_i, \quad \epsilon_i \sim \mathcal{N}(0, \sigma_n^2) \text{ independent}\]

Joint Distribution

Let \(\mathbf{x} = (x_1, \ldots, x_N)\) denote training locations. Write \(\mathbf{y} = (y_1, \ldots, y_N)\).

Claim: The joint distribution is \[\begin{bmatrix} \mathbf{f}_* \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K}_{**} & \mathbf{K}_{*} \\ \mathbf{K}_{*}^\top & \mathbf{K} + \sigma_n^2\mathbf{I} \end{bmatrix}\right)\]

Notation: Kernel matrices are

  • \(\mathbf{K} := [k(x_i, x_j)]_{i,j=1}^N \in \mathbf{R}^{N \times N}\)
  • \(\mathbf{K}_* := [k(x_i, x_j^*)]_{i=1,j=1}^{N,M} \in \mathbf{R}^{M \times N}\)

Joint Distribution

Let \(\mathbf{x} = (x_1, \ldots, x_N)\) denote training locations. Write \(\mathbf{y} = (y_1, \ldots, y_N)\).

Claim: The joint distribution is \[\begin{bmatrix} \mathbf{f}_* \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K}_{**} & \mathbf{K}_{*} \\ \mathbf{K}_{*}^\top & \mathbf{K} + \sigma_n^2\mathbf{I} \end{bmatrix}\right)\]

Justification

By definition of GP: \[\begin{bmatrix} \mathbf{f}_* \\ \mathbf{f} \end{bmatrix} \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \mathbf{K}_{**} & \mathbf{K}_* \\ \mathbf{K}_*^\top & \mathbf{K}\end{bmatrix}\right)\] where \(\mathbf{f} = (f(x_1), \ldots, f(x_N))\).

Since \(\mathbf{y} = \mathbf{f} + \boldsymbol{\epsilon}\) is a sum of independent Gaussians, it is Gaussian. Using independence of \(\mathbf{f}\) and \(\boldsymbol{\epsilon}\):

\[\text{Cov}(\mathbf{f}_*, \mathbf{y}) = \text{Cov}(\mathbf{f}_*, \mathbf{f}) = \mathbf{K}_*\]

\[\text{Cov}(\mathbf{y}, \mathbf{y}) = \text{Cov}(\mathbf{f} + \boldsymbol{\epsilon}, \mathbf{f} + \boldsymbol{\epsilon}) = \mathbf{K} + \sigma_n^2\mathbf{I}\]

Gaussian Process Posterior

Now we condition on observed \(\mathbf{y}\) to get a posterior for the \(f\) that generated the data. The Gaussian conditioning formula gives

\[p(\mathbf{f}^* | \mathbf{y}, \mathbf{X}, \mathbf{x}^*) = \mathcal{N}(\boldsymbol{\mu}_*, \boldsymbol{\Sigma}_*)\] where

\[\begin{align*} \boldsymbol{\mu}_* &= \mathbf{K}_{*}(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y}\\ \boldsymbol{\Sigma}_* &= \mathbf{K}_{**} - \mathbf{K}_{*}(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{K}_{*} \end{align*}\]

The next few slides interpret this formula.

Posterior Mean

Posterior Mean

Posterior Mean

\[\begin{align*} \boldsymbol{\mu}_*=\mathbf{K}_{*}\left(\mathbf{K}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y} \end{align*}\]

  • \(\mathbf{K}_{*}\) \(\to\) how much the test point \(x^{\ast}\) correlates with training data
  • \(\left(\mathbf{K}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y}\) \(\to\) reweight the training data according to measurement precision

Posterior Mean

\[\begin{align*} \boldsymbol{\mu}_*=\mathbf{K}_{*}\left(\mathbf{K}+\sigma_n^2 \mathbf{I}\right)^{-1} \mathbf{y} \end{align*}\]

If \(f^\ast\) and \(y\) were both scalars, this simplifies, \[\begin{align*} \mu_{f^\ast \vert y} = \frac{\operatorname{cov}\left(f^*, y\right)}{\sigma_y^{2}}y = \rho \frac{\sigma_{f^\ast}}{\sigma_y}y \end{align*}\]

  • \(\rho = 1\) \(\to\) copy \(y\) and map to the scale of \(f^*\).
  • \(\rho = 0\) \(\to\) ignore \(f^*\) and stay at mean 0.
  • \(\rho = -1\) \(\to\) copy \(y\), flip sign, map to the scale of \(f^*\).

Posterior Variance

Posterior Variance

\[\begin{align*} \boldsymbol{\Sigma}_* = \mathbf{K}_{**} - \mathbf{K}_{*}(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{K}_{*} \end{align*}\]

  • We reduce uncertainty when we observe data
  • If \(x^{\ast}\) is far from the data, then \(\mathbf{K}_{\ast }\) is small and we revert to the prior \(\mathbf{K}_{\ast\ast}\).

Discussion

Discuss [GP Conditioning] part (a) in the exercise sheet. Self-assign groups – only one response per group needed.

Consider a GP with zero mean function and RBF kernel \(k(x, x′) = \exp(−\|x − x′\|^2)\). You have two test points \(x^* = [0, 2]^\top\) at which you want predictions, but no training data.

  1. What is the mean vector and covariance matrix of the prior distribution \(p(\mathbf{f}_{*})\) at the test points.

Kernels

Linear Regression and Similarity

Linear regression makes predictions using

\[\begin{align*} \mathbb{E}[y | \mathbf{x}] = \mathbf{x}^\top \boldsymbol{\beta}. \end{align*}\]

Geometrically, predictions at nearby points should be similar.

What does “nearby” mean? In linear regression, \[\begin{align*} \mathbf{x}^\top \mathbf{x}' \end{align*}\] measures closeness.

Justification

Suppose \(\beta \sim \mathcal{N}\left(0, \sigma^2 I\right)\). Then,

\[\begin{align*} \text{Cov}(y, y') &= \mathbb{E}[\mathbf{x}^\top \beta \beta^\top \mathbf{x}'] \\ &= \mathbf{x}^\top \mathbb{E}[\beta \beta^\top] \mathbf{x}' \\ &= \mathbf{x}^\top (\sigma^2 \mathbf{I}) \mathbf{x}' \\ &= \sigma^2 \mathbf{x}^\top \mathbf{x}' \end{align*}\]

Covariance between predictions depends on \(\mathbf{x}^\top \mathbf{x}'\).

Linear Regression and Similarity

Kernel answer: Define closeness explicitly.

\[\begin{align*} k(\mathbf{x}, \mathbf{x}') = \text{how much } f\left(\mathbf{x}\right) \text{ and } f\left(\mathbf{x'}\right) \text{should covary} \end{align*}\]

The function \(k\) is called a “kernel”.

Radial Basis Function (RBF) Kernel

\[k_{\text{RBF}}(\bs{x}, \bs{x}') = \sigma_f^2 \exp\left(-\frac{\|\bs{x} - \bs{x}'\|^2}{2\ell^2}\right)\]

  • At distance \(\|\bs{x} - \bs{x}'\| = 0\): perfect correlation
  • At distance \(\|\bs{x} - \bs{x}'\| = \ell\): decays to \(\sigma_{f}e^{-\frac{1}{2}} \approx 0.6 \sigma_{f}\)
  • At distance \(\|\bs{x} - \bs{x}'\| = 3\ell\): nearly zero

Lengthscale Interpretation

\[\text{Correlation} = \exp\left(-\frac{\tau^2}{2\ell^2}\right), \quad \tau = \|\bs{x} - \bs{x}'\|\]

Small \(\ell\) (relative to data range)

  • Allows rapid oscillation
  • Can fit data with high-frequency variation

Long lengthscale:

  • Enforces slower variation
  • Smooths over local fluctuations

Lengthscale Interpretation

\[\text{Correlation} = \exp\left(-\frac{\tau^2}{2\ell^2}\right), \quad \tau = \|\bs{x} - \bs{x}'\|\]

Periodic Kernel

\[k_{\text{per}}(\bs{x}, \bs{x}') = \sigma_f^2 \exp\left(-\frac{2\sin^2(\pi\|\bs{x}-\bs{x}'\|/p)}{\ell^2}\right)\]

Parameters:

  • \(p\): period
  • \(\ell\): lengthscale (similar to RBF)
  • \(\sigma_f^2\): signal variance

This will help us model stellar rotation.

Matérn Kernels

\[k_{\text{Matérn-3/2}}(\bs{x}, \bs{x}') = \sigma_f^2\left(1 + \frac{\sqrt{3}\tau}{\ell}\right)\exp\left(-\frac{\sqrt{3}\tau}{\ell}\right)\]

\[k_{\text{Matérn-5/2}}(\bs{x}, \bs{x}') = \sigma_f^2\left(1 + \frac{\sqrt{5}\tau}{\ell} + \frac{5\tau^2}{3\ell^2}\right)\exp\left(-\frac{\sqrt{5}\tau}{\ell}\right)\]

Unlike RBF: Finite differentiability

  • Matérn-3/2: once differentiable
  • Matérn-5/2: twice differentiable

(You don’t need to memorize these formulas!)

Matérn Kernels

Samples that use a Matérn kernel are often “rougher” are more realistic.

Combining Kernels

If \(k_1\) and \(k_2\) are valid kernels, so are

\[\begin{align*} k_{\text{sum}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') + k_2(\bs{x}, \bs{x}') (\text{ superposition})\\ k_{\text{prod}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') \times k_2(\bs{x}, \bs{x}') (\text{ modulation}) \end{align*}\]

Addition is like two independent phenomena occuring at once.

  • \(k_{RBF} + k_{\text{periodic}}\): smooth trend + rotation
  • Like hearing a melody + rhythm

Combining Kernels

If \(k_1\) and \(k_2\) are valid kernels, so are

\[\begin{align*} k_{\text{sum}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') + k_2(\bs{x}, \bs{x}') (\text{ superposition})\\ k_{\text{prod}}(\bs{x}, \bs{x}') = k_1(\bs{x}, \bs{x}') \times k_2(\bs{x}, \bs{x}') (\text{ modulation}) \end{align*}\]

Multiplication is like where one pattern modulates another.

  • \(k_{RBF} \times k_{\text{periodic}}\): periodic signal with evolving magnitude
  • Like a volume knob controlling a repetitive signal

Combining Kernels

Since kernels are closed using these algebraic operations, they can be tailored to the specific problem of interest.

Kernel Code

library(greta)
library(greta.gp)

x_test <- seq(1, 10, .5)
k_per <- periodic(period = 2 * pi, lengthscale = 1, variance = 1)
K <- calculate(k_per(x_test))

image(K[[1]], asp = 1)

Visualize Samples

f <- gp(x_test, kernel = k_per)
draws <- calculate(f, nsim = 2)
plot(x_test, draws[[1]][1, , 1])

Kernel Algebra

Kernel objects can be added and multiplied.

new_kernel <- k_per + rbf(lengthscales = 1, variance = 1)

Exercise

Respond to [Interactive Kernel Choice] parts (a) and (b) in the exercise sheet.